library(tidyverse)
library(rbokeh)
# devtools::install_github("hafen/trelliscopejs")
library(trelliscopejs)

Problem Description

Here’s the problem description from D3M:

Radon is a carcinogen – a naturally occurring radioactive gas whose decay products are also radioactive – known to cause lung cancer in high concentrations and estimated to cause several thousand lung cancer deaths per year in the United States. The distribution of radon levels in U.S. homes varies greatly, with some houses having dangerously high concentrations. To identify areas of high radon exposure, the Environmental Protection Agency coordinated radon measurements in a random sample of houses throughout the country.

The overall goal is to estimate the distribution of radon levels across U.S. counties, so that homeowners could make decisions about measuring or remediating the radon in their houses based on the best available knowledge of local conditions. For the purpose of this problem, we had an important predictor - whether the measurement was taken in a basement. (Radon comes from underground and can enter more easily when a house is built into the ground.) We also had an important county-level predictor - a measurement of soil uranium that was available at the county level. In this problem we’ll look at Minnesota, a state that contains 85 county’s in which different measurements are taken, ranging from 2 till 80 measurements per county. Build a model to make predictions of radon levels based on the county data and the presence of a basement.

Data

There are two sources of data. One is from Andrew Gelman’s hierarchical modeling book and the other from D3M. Gelman’s data has results across the United States, while the D3M data is just for Minnesota.

# from Gelman
radon <- read_csv("data/ARM/srrs2.dat", na = c(".", ""))
radon
## # A tibble: 12,777 x 25
##    idnum state state2 stfips   zip region typebldg floor  room basement
##    <int> <chr>  <chr>  <int> <int>  <int>    <int> <int> <int>    <chr>
##  1     1    AZ     AZ      4 85920      1        1     1     2        N
##  2     2    AZ     AZ      4 85920      1        0     9     0     <NA>
##  3     3    AZ     AZ      4 85924      1        1     1     3        N
##  4     4    AZ     AZ      4 85925      1        1     1     3        N
##  5     5    AZ     AZ      4 85932      1        1     1     1        N
##  6     6    AZ     AZ      4 85936      1        1     1     1        N
##  7     7    AZ     AZ      4 85936      1        0     9     0     <NA>
##  8     8    AZ     AZ      4 85936      1        1     1     1        N
##  9     9    AZ     AZ      4 85938      1        0     9     0     <NA>
## 10    10    AZ     AZ      4 85938      1        1     1     3     <NA>
## # ... with 12,767 more rows, and 15 more variables: windoor <chr>,
## #   rep <int>, stratum <int>, wave <int>, starttm <chr>, stoptm <chr>,
## #   startdt <chr>, stopdt <chr>, activity <dbl>, pcterr <dbl>,
## #   adjwt <dbl>, dupflag <int>, zipflag <int>, cntyfips <int>,
## #   county <chr>
# from D3M
radon_mn <- read_csv("data/r_26/raw_data/radon.csv")
radon_mn
## # A tibble: 919 x 28
##    radonFile_index state state2 stfips   zip region typebldg floor  room
##              <int> <chr>  <chr>  <dbl> <int>  <dbl>    <dbl> <dbl> <dbl>
##  1            5081    MN     MN     27 55735      5        1     1     3
##  2            5082    MN     MN     27 55748      5        1     0     4
##  3            5083    MN     MN     27 55748      5        1     0     4
##  4            5084    MN     MN     27 56469      5        1     0     4
##  5            5085    MN     MN     27 55011      3        1     0     4
##  6            5086    MN     MN     27 55014      3        1     0     4
##  7            5087    MN     MN     27 55014      3        1     0     4
##  8            5088    MN     MN     27 55014      3        1     0     2
##  9            5089    MN     MN     27 55014      3        1     0     4
## 10            5090    MN     MN     27 55014      3        1     0     4
## # ... with 909 more rows, and 19 more variables: basement <chr>,
## #   windoor <chr>, rep <int>, stratum <dbl>, wave <int>, starttm <dbl>,
## #   stoptm <dbl>, startdt <dbl>, stopdt <dbl>, activity <dbl>,
## #   pcterr <dbl>, adjwt <dbl>, dupflag <dbl>, zipflag <dbl>,
## #   cntyfips <dbl>, county <chr>, fips <dbl>, Uppm <dbl>,
## #   county_code <int>
# how do these datasets differ in terms of variables?
setdiff(names(radon), names(radon_mn))
## [1] "idnum"
setdiff(names(radon_mn), names(radon))
## [1] "radonFile_index" "fips"            "Uppm"            "county_code"

I haven’t found a data dictionary for these. The D3M schema doesn’t provide any information about the variables.

Here’s a summary of each of the variables:

summary(radon)
##      idnum          state              state2              stfips     
##  Min.   :    1   Length:12777       Length:12777       Min.   : 4.00  
##  1st Qu.: 3195   Class :character   Class :character   1st Qu.:18.00  
##  Median : 6389   Mode  :character   Mode  :character   Median :27.00  
##  Mean   : 6389                                         Mean   :28.14  
##  3rd Qu.: 9583                                         3rd Qu.:38.00  
##  Max.   :12777                                         Max.   :55.00  
##                                                                       
##       zip            region          typebldg         floor      
##  Min.   : 1001   Min.   : 1.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:17756   1st Qu.: 2.000   1st Qu.:1.000   1st Qu.:0.000  
##  Median :54155   Median : 3.000   Median :1.000   Median :0.000  
##  Mean   :45125   Mean   : 3.768   Mean   :1.038   Mean   :0.522  
##  3rd Qu.:63090   3rd Qu.: 5.000   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :86512   Max.   :11.000   Max.   :5.000   Max.   :9.000  
##  NA's   :1                                                       
##       room         basement           windoor               rep       
##  Min.   :0.000   Length:12777       Length:12777       Min.   :1.000  
##  1st Qu.:2.000   Class :character   Class :character   1st Qu.:2.000  
##  Median :3.000   Mode  :character   Mode  :character   Median :3.000  
##  Mean   :3.026                                         Mean   :3.009  
##  3rd Qu.:4.000                                         3rd Qu.:4.000  
##  Max.   :7.000                                         Max.   :5.000  
##                                                        NA's   :934    
##     stratum            wave          starttm             stoptm         
##  Min.   : 1.000   Min.   :  1.00   Length:12777       Length:12777      
##  1st Qu.: 2.000   1st Qu.: 34.00   Class :character   Class :character  
##  Median : 2.000   Median : 56.00   Mode  :character   Mode  :character  
##  Mean   : 3.415   Mean   : 59.09                                        
##  3rd Qu.: 4.000   3rd Qu.: 82.00                                        
##  Max.   :29.000   Max.   :140.00                                        
##                   NA's   :934                                           
##    startdt             stopdt             activity          pcterr      
##  Length:12777       Length:12777       Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  1.10   1st Qu.:  5.20  
##  Mode  :character   Mode  :character   Median :  2.30   Median :  9.90  
##                                        Mean   :  4.48   Mean   : 13.46  
##                                        3rd Qu.:  4.80   3rd Qu.: 18.40  
##                                        Max.   :273.50   Max.   :490.40  
##                                                                         
##      adjwt             dupflag           zipflag             cntyfips     
##  Min.   :   2.082   Min.   :0.00000   Min.   :0.0000000   Min.   :  1.00  
##  1st Qu.: 218.513   1st Qu.:0.00000   1st Qu.:0.0000000   1st Qu.: 17.00  
##  Median : 443.844   Median :0.00000   Median :0.0000000   Median : 47.00  
##  Mean   : 540.962   Mean   :0.03921   Mean   :0.0008609   Mean   : 81.78  
##  3rd Qu.: 675.416   3rd Qu.:0.00000   3rd Qu.:0.0000000   3rd Qu.: 97.00  
##  Max.   :2364.763   Max.   :2.00000   Max.   :1.0000000   Max.   :999.00  
##                                                                           
##     county         
##  Length:12777      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

The data from Gelman also provides geographical coordinates for the counties:

geo <- read_csv("data/ARM/cty.dat")
geo
## # A tibble: 3,194 x 7
##    stfips ctfips    st      cty     lon    lat    Uppm
##     <chr>  <chr> <chr>    <chr>   <dbl>  <dbl>   <dbl>
##  1     01    001    AL  AUTAUGA -86.643 32.534 1.78331
##  2     01    003    AL  BALDWIN -87.750 30.661 1.38323
##  3     01    005    AL  BARBOUR -85.393 31.870 2.10105
##  4     01    007    AL     BIBB -87.126 32.998 1.67313
##  5     01    009    AL   BLOUNT -86.568 33.981 1.88501
##  6     01    011    AL  BULLOCK -85.716 32.100 2.46112
##  7     01    013    AL   BUTLER -86.680 31.752 1.91714
##  8     01    015    AL  CALHOUN -85.827 33.772 1.97532
##  9     01    017    AL CHAMBERS -85.392 32.914 1.75695
## 10     01    019    AL CHEROKEE -85.604 34.176 1.76798
## # ... with 3,184 more rows

Exploration

Response variable: activity

Let’s look at the response variable activity. From the summary above, we see that it is non-negative. A quick look at its distribution:

plot(sort(radon$activity))

It looks very heavy-tailed so we probably want to look into a log transformation. But there are zero-valued observations. Also, are there any missing values?

length(which(is.na(radon$activity)))
## [1] 0

No missing values. Let’s see how many zero-valued observations there are:

length(which(radon$activity == 0))
## [1] 90

Not many. What’s the lowest non-zero value we observe?

min(radon$activity[radon$activity > 0])
## [1] 0.1

Let’s create a log activity variable and deal with zeros by setting them to the lowest-observed non-zero value.

radon$log_activity <- log2(ifelse(radon$activity == 0, 0.1, radon$activity))

Now let’s look at the distribution of this variable:

plot(sort(radon$log_activity))

This looks better. It may even be normally distributed. Let’s check:

qqnorm(radon$log_activity)
qqline(radon$log_activity)

We’ll use this variable in our modeling.

Other covariates

Here are some exploratory visualizations of some of the other covariates.

Some of these variables, while they look like numeric, are treated better as categorical, so we’ll change their type before plotting:

radon$room <- as.character(radon$room)
radon$floor <- as.character(radon$floor)
radon$typebldg <- as.character(radon$typebldg)

wave

I don’t know what the wave variable is. It’s numeric so let’s look at its distribution.

figure(data = radon) %>% ly_quantile(wave)
figure(data = radon) %>% ly_hist(wave)

The distribution has two modes…

room

figure(data = radon) %>% ly_bar(room, hover = TRUE)

Strange that there aren’t many observed room values of 5 and 6 when there’s a significant amount of 7.

floor

figure(data = radon) %>% ly_bar(floor, hover = TRUE)

Why is there a floor 9 but nothing between 3 and 9?

typebldg

figure(data = radon) %>% ly_bar(typebldg, hover = TRUE)

I don’t know what this is, but 1 is the most prominent (maybe that’s residential?).

basement

figure(data = radon) %>% ly_bar(basement, hover = TRUE)

I wonder what “0” means…

Variable combinations

figure(data = radon) %>% ly_bar(room, color = basement, hover = TRUE)
figure(data = radon) %>% ly_bar(room, color = basement, hover = TRUE, position = "fill")
figure(data = radon) %>% ly_bar(typebldg, color = room, hover = TRUE, position = "fill")

Merging in geo

A companion dataset is provided with county-level soil uranium levels (ppm), labeled Uppm, and geographical coordinates for the counties. We want to include this in our dataset so we can use Uppm in our modeling.

Note: All of the work that goes into figuring out how to merge these two data frames, including figuring out what variables to join on, making their types commensurate, making sure the data frame to be joined has unique rows per joining variables, etc., is very ad-hoc yet a very important part of getting data ready for analysis. I’m particularly verbose in this section just to point out all the steps it takes, and even with this I’m glossing over some. I don’t think we can expect a D3M system to always have perfect data fed in, so it brings up a lot of things to think about in terms of how an interface for a non-technical user could help expose and correct these kind of issues…

Looking at the radon data and the geo data, it appears that there are two columns to join on, cntyfips and stfips, state and county fips codes. Note that the county fips is only unique within a state:

radon %>%
  group_by(cntyfips) %>%
  summarise(n_states = length(unique(stfips))) %>%
  arrange(-n_states)
## # A tibble: 117 x 2
##    cntyfips n_states
##       <int>    <int>
##  1        3        9
##  2       13        9
##  3        5        8
##  4        7        8
##  5        9        8
##  6       17        8
##  7       23        8
##  8        1        7
##  9       11        7
## 10       15        7
## # ... with 107 more rows

So county fips code 3 exists in 9 states, etc. So we need to use both to join.

Also, note that the geo dataset uses ctfips instead of cntyfips:

head(geo)
## # A tibble: 6 x 7
##   stfips ctfips    st     cty     lon    lat    Uppm
##    <chr>  <chr> <chr>   <chr>   <dbl>  <dbl>   <dbl>
## 1     01    001    AL AUTAUGA -86.643 32.534 1.78331
## 2     01    003    AL BALDWIN -87.750 30.661 1.38323
## 3     01    005    AL BARBOUR -85.393 31.870 2.10105
## 4     01    007    AL    BIBB -87.126 32.998 1.67313
## 5     01    009    AL  BLOUNT -86.568 33.981 1.88501
## 6     01    011    AL BULLOCK -85.716 32.100 2.46112

Also, in radon, cntyfips and stfips are integers, but they’re strings in geo, so we need to convert them to integers to be able to join:

geo$ctfips <- as.integer(geo$ctfips)
geo$stfips <- as.integer(geo$stfips)

Now we can try to join them:

radon_join <- left_join(radon, geo, by = c(cntyfips = "ctfips", stfips = "stfips"))
radon_join
## # A tibble: 13,143 x 31
##    idnum state state2 stfips   zip region typebldg floor  room basement
##    <int> <chr>  <chr>  <int> <int>  <int>    <chr> <chr> <chr>    <chr>
##  1     1    AZ     AZ      4 85920      1        1     1     2        N
##  2     2    AZ     AZ      4 85920      1        0     9     0     <NA>
##  3     3    AZ     AZ      4 85924      1        1     1     3        N
##  4     4    AZ     AZ      4 85925      1        1     1     3        N
##  5     5    AZ     AZ      4 85932      1        1     1     1        N
##  6     6    AZ     AZ      4 85936      1        1     1     1        N
##  7     7    AZ     AZ      4 85936      1        0     9     0     <NA>
##  8     8    AZ     AZ      4 85936      1        1     1     1        N
##  9     9    AZ     AZ      4 85938      1        0     9     0     <NA>
## 10    10    AZ     AZ      4 85938      1        1     1     3     <NA>
## # ... with 13,133 more rows, and 21 more variables: windoor <chr>,
## #   rep <int>, stratum <int>, wave <int>, starttm <chr>, stoptm <chr>,
## #   startdt <chr>, stopdt <chr>, activity <dbl>, pcterr <dbl>,
## #   adjwt <dbl>, dupflag <int>, zipflag <int>, cntyfips <int>,
## #   county <chr>, log_activity <dbl>, st <chr>, cty <chr>, lon <dbl>,
## #   lat <dbl>, Uppm <dbl>

There’s a problem here. The result of the join resulted in more rows than is in our original radon dataset:

nrow(radon)
## [1] 12777
nrow(radon_join)
## [1] 13143

This means that the rows of geo are not uniquely defined by ctfips and stfips. Let’s investigate:

geo %>%
  group_by(ctfips, stfips) %>%
  summarise(n = n()) %>%
  arrange(-n)
## Source: local data frame [3,111 x 3]
## Groups: ctfips [301]
## 
## # A tibble: 3,111 x 3
##    ctfips stfips     n
##     <int>  <int> <int>
##  1     31      8    12
##  2      5      8     7
##  3     59      8     6
##  4      1      8     3
##  5      5     51     3
##  6     15     51     3
##  7     21     25     3
##  8     37      6     3
##  9     59     51     3
## 10     83      6     3
## # ... with 3,101 more rows

There quite a few state/county combinations that have more than one record. Let’s see what’s different within each of these groups. Here, we check to see whether all values are equal within a county/state combination:

tmp <- geo %>%
  group_by(ctfips, stfips) %>%
  summarise_all(function(x) all(x == x[1]))
tmp
## Source: local data frame [3,111 x 7]
## Groups: ctfips [?]
## 
## # A tibble: 3,111 x 7
##    ctfips stfips    st   cty   lon   lat  Uppm
##     <int>  <int> <lgl> <lgl> <lgl> <lgl> <lgl>
##  1      1      1  TRUE  TRUE  TRUE  TRUE  TRUE
##  2      1      4  TRUE  TRUE  TRUE  TRUE  TRUE
##  3      1      5  TRUE  TRUE  TRUE  TRUE  TRUE
##  4      1      6  TRUE  TRUE  TRUE  TRUE  TRUE
##  5      1      8  TRUE  TRUE FALSE FALSE  TRUE
##  6      1      9  TRUE  TRUE  TRUE  TRUE  TRUE
##  7      1     10  TRUE  TRUE  TRUE  TRUE  TRUE
##  8      1     11  TRUE  TRUE  TRUE  TRUE  TRUE
##  9      1     12  TRUE  TRUE  TRUE  TRUE  TRUE
## 10      1     13  TRUE  TRUE  TRUE  TRUE  TRUE
## # ... with 3,101 more rows

Looking at the first few rows, we only see the case of a latitude and longitude with varying values within a group. Let’s see if that’s consistent

sapply(tmp, all)
## ctfips stfips     st    cty    lon    lat   Uppm 
##   TRUE   TRUE   TRUE   TRUE  FALSE  FALSE   TRUE

This tells us that within each state/county group, all variables except latitude and longitude are constant.

We can investigate to what extent the lat/lon variables vary within a group with the following:

tmp2 <- geo %>%
  group_by(ctfips, stfips) %>%
  summarise(
    dlat = diff(range(lat)),
    dlon = diff(range(lon))
  ) %>%
  filter(dlat > 0 | dlon > 0)

max(tmp2$dlat)
## [1] 1.415
max(tmp2$dlon)
## [1] 1.775

This translates to a difference of about 100 miles in both latitude and longitude (of course it varies depending on where the longitude is). Since we don’t really care about lat/lon and they are close enough within a group, we will simply roll them up into one by averaging:

geo2 <- geo %>%
  group_by(ctfips, stfips, st, cty, Uppm) %>%
  summarise(lon = mean(lon), lat = mean(lat))

Now we can try our join again:

radon <- left_join(radon, geo2, by = c(cntyfips = "ctfips", stfips = "stfips"))
radon
## # A tibble: 12,777 x 31
##    idnum state state2 stfips   zip region typebldg floor  room basement
##    <int> <chr>  <chr>  <int> <int>  <int>    <chr> <chr> <chr>    <chr>
##  1     1    AZ     AZ      4 85920      1        1     1     2        N
##  2     2    AZ     AZ      4 85920      1        0     9     0     <NA>
##  3     3    AZ     AZ      4 85924      1        1     1     3        N
##  4     4    AZ     AZ      4 85925      1        1     1     3        N
##  5     5    AZ     AZ      4 85932      1        1     1     1        N
##  6     6    AZ     AZ      4 85936      1        1     1     1        N
##  7     7    AZ     AZ      4 85936      1        0     9     0     <NA>
##  8     8    AZ     AZ      4 85936      1        1     1     1        N
##  9     9    AZ     AZ      4 85938      1        0     9     0     <NA>
## 10    10    AZ     AZ      4 85938      1        1     1     3     <NA>
## # ... with 12,767 more rows, and 21 more variables: windoor <chr>,
## #   rep <int>, stratum <int>, wave <int>, starttm <chr>, stoptm <chr>,
## #   startdt <chr>, stopdt <chr>, activity <dbl>, pcterr <dbl>,
## #   adjwt <dbl>, dupflag <int>, zipflag <int>, cntyfips <int>,
## #   county <chr>, log_activity <dbl>, st <chr>, cty <chr>, Uppm <dbl>,
## #   lon <dbl>, lat <dbl>

This looks good.

A Very Simple Model

Let’s consider a very simple linear model. We want to model the log_activity as a function of any of the covariates present in the data. Ideally, an analyst would make some more exploratory plots investigating individual relationships between log_activity and and the covariates. But suppose for the sake of D3M that the analyst either picks a specific model or is sent back a model result from TA2 and wants to evaluate the model result.

In this section we’ll cover just a couple of elementary classical diagnostic plots that might be made and diagnostic measures that might be looked at for one linear model result. This is written to hopefully be accessible to someone who is not very familiar with statistics, and since it’s being prepared hastily, it almost surely misses that mark.

Here we’ll use Barret’s package to simulate what the output received for a model might look like.

devtools::install_github("d3m-purdue/d3mLm")
library(d3mLm)
## 
## Attaching package: 'd3mLm'
## The following object is masked _by_ '.GlobalEnv':
## 
##     radon_mn

Suppose the model specified was simply log_activity vs. Uppm. In other words, if we let \(y_i\) be the \(i\)th observation of the log radon activity and \(x_i\) be the \(i\)th observation of Uppm then our model is defined by:

\[y_i = a + bx_i + \varepsilon\]

Where \(a\) and \(b\) are the intercept and slope estimates, and \(\varepsilon_i\) is a random error term, and the \(\varepsilon_i\) are assumed to be independent and identically distributed (iid) following the normal distribution with a mean of 0 and variance of \(\sigma^2\).

We obtain the model fit with the following:

rdlm <- lm(log_activity ~ Uppm, data = radon, na.action = na.exclude)

(Note: if we don’t use na.exclude we won’t get residuals back with the same length as our input data as there are missing Uppm values…)

Using Barret’s package, we can get a mock-up of a TA2-type response for fitting such a model.

result <- extract_lm(rdlm)
## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument

We can make this look like a TA2 response by calling result$as_json(), but since this returns some pretty long vectors of diagnostic data, we’ll just access the object directly from R. From the R syntax, you should be able to deduce how the results would be obtained from the JSON output.

Interpreting the model

The model that we fit, \(y_i = a + b x_i + \varepsilon_i\) has a slope parameter \(a\) and an intercept parameter \(b\). Our model estimated \(a\) and \(b\), producing coefficients \(\hat{a}\) and $. Our fitted log radon activity values are:

\[\hat{y}_i = $\hat{a}$ + \hat{b}x\]

To interpret the model result, we can examine the coefficient estimates:

result$diag_coefs
##          term   estimate  std.error statistic       p.value
## 1 (Intercept)  1.5022975 0.04268792 35.192564 7.460569e-259
## 2        Uppm -0.1700675 0.02028583 -8.383559  5.678824e-17

The p.value field tells us whether these parameters are statistically significant. Here, statistically significant means that a coefficient is likely to be a meaningful addition to the model because changes in the the associated covariate’s value are related to changes in the response. A coefficient is generally considered to be statistically significant if the p-value is less than 0.05. Here, both the slope and intercept parameters of the model are very significant. The intercept coefficient estimate, \(\hat{a}\) tells us that we can expect a log radon activity of about 1.5 when the uranium level is 0. Also, for every unit increase in uranium level, we expect the log radon activity to change by -0.17 (the coefficient estimate \(\hat{b}\) for our covariate Uppm). This relationship will be more clear in the fitted model diagnostic plot.

Diagnostic metrics

First, let’s look at some of the model diagnostic metrics:

result$diag_model
##     r.squared adj.r.squared    sigma statistic      p.value df    logLik
## 1 0.005559883   0.005480777 1.653113  70.28406 5.678824e-17  2 -24159.26
##        AIC      BIC deviance df.residual
## 1 48324.52 48346.83  34353.8       12571

The first metric, r.squared, tells us the proportion of the variance in our response variable log_activity that is explianed or predictable by our covariate, Uppm. A very good model will have a value much closer to 1. This is a very weak \(R^2\) value indicating a poor fit. Other metrics like AIC and BIC are used when comparing models with each other.

It will be interesting to think about the kind of classical model-specific ways of determining the appropriateness of models described here vs. more general approaches like cross-validation.

Diagnostic plots

One of the most basic diagnostic plots we can make is of the raw data with the fitted model superposed. This is very simple to do with one covariate and a simple linear model, but it of course gets a lot more complicated as the model gets more complex and more covariates are added.

Fitted model on raw data

Here we plot the raw data and use result$diag_coefs to superpose the fitted line:

coefs <- result$diag_coefs$estimate
ggplot(radon, aes(Uppm, log_activity)) +
  geom_point(alpha = 0.25) +
  geom_abline(intercept = coefs[1], slope = coefs[2],
    color = "steelblue", size = 2)
## Warning: Removed 204 rows containing missing values (geom_point).

From this plot we see that the fitted line intersects at 1.5 when Uppm = 0 and has a negative slope as we expect from looking at the coefficients. From looking at the plot, there is a large cloud of points and the points vary quite a bit around the line, showing us visually why this one variable, Uppm, does not explain much of what is happening with the radon activity.

Here’s what a plot might look like for (synthetic) data with the roughly the same model coefficients but a much better fit:

x <- runif(1000, 0, 4)
d <- data.frame(Uppm = x,
  log_activity = rnorm(1000, mean = coefs[1] + coefs[2] * x, sd = 0.2))
ggplot(d, aes(Uppm, log_activity)) +
  geom_point(alpha = 0.25) +
  geom_abline(
    intercept = result$diag_coefs$estimate[1],
    slope = result$diag_coefs$estimate[2],
    color = "steelblue", size = 2) +
  xlim(range(radon$Uppm)) +
  ylim(range(radon$log_activity))

When more covariates are used, the fitted model diagnostic plots get more complex.

Residuals plots

As we saw in the fitted model plot, our model does not capture everything that explains the radon activity. Our model has error, denoted by our error term \(\varepsilon\). Recall that a linear model assumes that its error terms are iid normally distributed with a mean of 0 and variance \(\sigma_2\). While we do not observe the error directly, we can estimate the error by calculating residuals, which are the fitted values, \(\hat{y}\) subtracted from the observed values:

\[e_i = y_i - \hat{y}_i\]

There are a lot of interesting plots that can be made with the residuals. Barret’s package returns the residuals with result$diag_data$.resid. We can try to join this back up with our original dataset for plotting:

radon$resid <- result$diag_data$.resid
## Error in `$<-.data.frame`(`*tmp*`, resid, value = c(-2.85416744799468, : replacement has 12573 rows, data has 12777

Note that since there are 4 missing values of Uppm in our data, and the residuals that are returned have omitted these missing values, we cannot join them… This is minor and we’ll work around this for now but it’s something to be aware of - we want to make sure any model diagnostic data returned from TA2 is able to be joined back up with our original data.

radon2 <- filter(radon, !is.na(Uppm))
radon2$resid <- result$diag_data$.resid

First, let’s plot the residuals vs. our predictor variable, Uppm:

ggplot(radon2, aes(Uppm, resid)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "loess", se = FALSE)

The residuals in this plot should be randomly varying around zero. In the plot we also added a smoother to the scatterplot to show how the mean of the residuals is varying with respect to Uppm. If we have a good fit, we expect this smoother to stay at zero across all values of Uppm.

d$resid <- d$log_activity - coefs[1] - coefs[2] * d$Uppm
ggplot(d, aes(Uppm, resid)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "loess", se = FALSE)

… more plots …

Summary

We have seen that this simple model does not do very well at explaining the log radon activity. Just because it doesn’t perform well doesn’t necessarily mean there are other models that do, but it is worth looking into other covariates or combinations of covariates that might improve things.

A more complex model

Include county (create a new unique county factor variable by combining the state and county codes) and floor (this variable will need to be cleaned up as well) in the model.

A multilevel model

Use the same variables as the previous model but fit a multilevel model, more along the lines of what Gelman does.

Exploratory Plots of Response vs. Covariates

We don’t know much about the variables in this data and how they relate to the response variable, so it is useful to make some exploratory visualizations investigating this. Ideally a domain expert would be very familiar with what each variable measures and what it means, but we don’t have that luxury with this data and it is actually very difficult to deduce what some of them mean and searching on the internet has not been very fruitful.

Here is a function that will create a boxplot of the log radon activity for each level of a categorical covariate:

## exploratory plots of response vs each categorical variable...
fct_boxplot <- function(dat, x_name, y_name) {
  radon[[x_name]] <- factor(radon[[x_name]])
  radon[[x_name]] <- forcats::fct_reorder(radon[[x_name]], radon[[y_name]])
  ggplot(radon, aes_string(x_name, y_name)) +
    geom_boxplot() +
    coord_flip()
}

Let’s use this to look at log radon activity for some of our categorical covariates.

state

fct_boxplot(radon, "state", "log_activity")

Radon levels definitely vary by state so it could be a useful covariate.

From some internet searching, it looks like the state of R5 corresponds to observations on indian reservations in different states. The source for this says that these should be omitted from an analysis.

state2

fct_boxplot(radon, "state2", "log_activity")

We see some new states. How is this different from state?

idx <- which(as.character(radon$state2) != as.character(radon$state))
table(paste(radon$state[idx], radon$state2[idx] , sep = "->"))
## 
## R5->MI R5->MN R5->WI 
##    204    269    461

It looks like state2 is state broken into MI, MN, and WI.

stfips

xtabs(~ state2 + stfips, data = radon)
##       stfips
## state2    4   18   25   26   27   29   38   42   55
##     AZ 1507    0    0    0    0    0    0    0    0
##     IN    0 1914    0    0    0    0    0    0    0
##     MA    0    0 1659    0    0    0    0    0    0
##     MI    0    0    0  204    0    0    0    0    0
##     MN    0    0    0    0 1188    0    0    0    0
##     MO    0    0    0    0    0 1859    0    0    0
##     ND    0    0    0    0    0    0 1596    0    0
##     PA    0    0    0    0    0    0    0 2389    0
##     WI    0    0    0    0    0    0    0    0  461

This looks equivalent to state so we can ignore it.

zip

length(unique(radon$zip))
## [1] 3195

There are too many zips to be a dummy regression variable.

region

fct_boxplot(radon, "region", "log_activity")

This variable doesn’t look that significant unless it has interactions with other variables…

typebldg

fct_boxplot(radon, "typebldg", "log_activity")

Might not be significant…

floor

fct_boxplot(radon, "floor", "log_activity")

room

fct_boxplot(radon, "room", "log_activity")

basement

fct_boxplot(radon, "basement", "log_activity")

windoor

table(radon$windoor, useNA = "always")
## 
##  <NA> 
## 12777

All missing - can ignore.

rep

table(radon$rep, useNA = "always")
## 
##    1    2    3    4    5 <NA> 
## 2341 2374 2360 2376 2392  934

This varible seems to represent the idea of repeated measures or something. It isn’t used in Gelman’s analysis so I don’t know if it is important.

Other variables

Here are the rest of the variables:

stratum, wave, starttm, stoptm, startdt, stopdt, activity, pcterr, adjwt, dupflag, zipflag, cntyfips, county, uid, st, cty, Uppm, lon, lat

Some of these names suggest clues as to what the variables are, but it is difficult to know what they mean and how they might be used in a model without knowing something more about the data…